The Thomas Fire has last for 14 days, ravaged Southern California north of Los Angeles and began to bear down on affluent swaths of Santa Barbara and Montecito these days.According to the National Weather Service, a red flag warning remained in effect in both the mountains in Santa Barbara County and along the South Coast with humidity dropping to the teens and wind gusts topping 55 mph overnight. In order to figure out what Internet users are talking about and how they feel about this disaster, this report analyzed data collected from twitter a week after the fire happened, due to the fact that Twitter API only allows people to access user data from very recent seven days for confidential reasons. This report mainly focus on three parts:
1.The frequency of words mentioned by users, showing by Word Cloud.
2.Visualization of sentiments towards the fire among different hashtags and different locations: whether people are more likely to complain or encourage each other when facing disaster. Besides normal maps, there is also a shiny application which creates an interactive map on tweet popularity.
3.Statistical analysis that whether there is relationship between retweet number and sentiment score, generated from tweet contents using Shapiro.test. Also, there would be an ANOVA table of whether location is influential on sentiment scores, i.e., people live in west may have higher absolute sentiment score than people live in the east.
At first I was trying to collect over 15000 data from twitter among three different keywords using searchTwitter() finction: two hashtag #Californiafire, #Californiawildfires and the keyword “Califrnia fire”. However, the geocode by Google API restricted 2500 requests per day for non-business use. Therefore, I decrease total data to 7500, with 2500 observations for each topic. After omitting NA and set the scope to the US, I got 1288 observations for #Californiafire, 876 observations for “California fire” and 1256 observations for #Californiawildfires, together 3240 observations.
To have a general understanding on what are the most popular words that people use in tweets to express their thoughts of the fire, below are the wordclouds under each topic and the total dataframe.
From the word cloud for total dataframe, we can see both positive sentiment, like brave, and negative sentiment, like homeless and criminal. While “trump” was also a popular word, which is quite interesting.
Wordcloud for hashtag #California fire
Wordcloud for key words “California fire”
Wordcloud for hashtag #California wild fires
From the above three wordcloud for each topic we can see, the first wordcloud for #California fire is approximately neutral, and the second wordcloud shows that tweets under key words “California fire” were more likely to be negative, invoving words like illegal, criminal. On the contrast, the third wordcloud for #California wild fire shows that twitters under this hashtag are more positive, with words like brave and bless.
Since the three wordcloud for each data set has different sentiment tendency, it is hard to say the sentiment of overall data is positive or negative. Then I wrote a sentiment score function to calculate the score of every text. Text with negative sentiment has negative socre, and the higher absloute value of score, the stronger sentiment it has. Below is the histogram for the overall data, we can see that the proportion of negative words is larger, and their sentiment are stronger, which means people are more likely to complain for the fire instead of praying for the fire.
Having calculated sentiment scores, the question is whether there is relationship between the intensity of sentiment and people’s location, i.e. people live in the west area like California may have stronger sentiment than people from the east coast. To visaulize this question, below are the maps of sentimnet score for the total data and three set seperately. Here red represents positive sentiment and blue represent regative sentiment. There seems to be more users located on the east coast than on the west coast. Also, the color for tweets from the west coast is darker, especially in south California which means people there have more negative sentiment. While the reslut is in accordance with wordclouds that the sentiment under keyword “California fire” is more likely to be negative while under the hashtag #California wild fire is very positive.
I created an interactive map using Shiny application. As we already seen the sentiment scroe on the map, the interactive map focus on the number of retweets counts of each tweets to generate the popularity of the twitter. In the map, the deeper the color of the popup points, the more popular the tweet content is. Once you get access to the map, the shiny feature enable you to explore the data based on your preference. You can zoom the scale of the map and click on every individual points to find out more detail about that tweet, like sentiment socre, user name and retweet cout. Also, you can discover whether retweet counts is related to position of the user. Upgraded from ggmap experience, Shiny offers a great opportunities for better data visualization and interaction. Below is the link for the application: ###https://sabrina414.shinyapps.io/InteractiveMap/
Having virsualize the data, here are some statistical analysis to support the findings above.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -4.00000 -1.00000 0.00000 -0.06316 0.00000 5.00000
From the summary of score we can see the average sentiment score is negative, with minimum of -4 and maximum of 5. This means the overall sentiment is more negative, which is the same conclusion as above.
The histogram shows that the data is approximately normal distributed
Below is the Anova table that analyze the relationship between sentiment and retweet number. Here p-value<0.05, which means that we should reject the null hypothesis at 95% confidence interval. Then the conclusion is that sentiment score have effect on retweet count, the stronger sentiment is, the more retweet count it would cause.
##
## Call:
## lm(formula = total$retweetCount ~ total$absolute_score)
##
## Residuals:
## Min 1Q Median 3Q Max
## -322.5 -129.3 -93.7 -53.0 10922.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 94.74 10.12 9.362 < 2e-16 ***
## total$absolute_score 46.16 10.37 4.452 8.79e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 462.9 on 3418 degrees of freedom
## Multiple R-squared: 0.005764, Adjusted R-squared: 0.005474
## F-statistic: 19.82 on 1 and 3418 DF, p-value: 8.793e-06
The smooth line confirms the conclusion above. However, the dramatic trend may indicate specific relationship between retweet count and sentiment score, which need future investigation.
Is location influential on the strength of attitude? Both latitude and longitude are highly insignificant in the anova table below, which means that the strength of attitude is not so related with people’s location.
##
## Call:
## lm(formula = total$absolute_score ~ total$lat + total$lon)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.6598 -0.6102 -0.5823 0.4008 4.4138
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.6464250 0.1315383 4.914 9.33e-07 ***
## total$lat 0.0018418 0.0027113 0.679 0.497
## total$lon 0.0010699 0.0007512 1.424 0.154
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7634 on 3417 degrees of freedom
## Multiple R-squared: 0.0007787, Adjusted R-squared: 0.0001938
## F-statistic: 1.331 on 2 and 3417 DF, p-value: 0.2642
From the analysis above, we can see that the sentiment of texts is different among different keywords and hashtags. Overall the data set is more negative. Also, the popularity of the text, represented by retweet counts, is related to the strength of attitude. However, there might be some repetitive texts among these three dataset since the keywords and hashtags are quite similar. For future improvement, these repetitive texts should be removed. Also, it would be better to overcome the restrict of Google geocode API and gather more data. A larger dataset will saturate this project with more factual evidences.